Dynamics of text generation with realistic Zipf distribution 
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We investigate the origin of Zipf's law for words in written texts by means of a stochastic dy- 
namical model for text generation. The model incorporates both features related to the general 
structure of languages and memory effects inherent to the production of long coherent messages in 
the communication process. It is shown that the multiplicative dynamics of our model leads to rank- 
frequency distributions in quantitative agreement with empirical data. Our results give support to 
the linguistic relevance of Zipf's law in human language. 



Natural languages are complex systems that have 
evolved as effective dynamical structures, capable of cod- 
ifying and transmitting highly nontrivial information. 
The foremost example of the kind is the genetic code, 
which evolved as a very efficient way of coding instruc- 
tions to replicate living things Q . Another example of a 
natural code is human language. Although it evolved in 
shorter time scales than the genetic code and was sub- 
ject to a noisier environment, it also achieved its present 
complexity by the natural process of evolution 0] . In re- 
cent years we have witnessed a gradual extension of the 
ideas and methods of statistical physics onto a vast range 
of complex phenomena outside the traditional bound- 
aries of physical science. When language is studied with 
tools originally developed within statistical mechanics, a 
very rich structure emerges at levels ranging from simple 
word counts to large scale organizational patterns that 
span many thousand of words j§ f§ |, g f|. The use of 
statistics applied to human language records draws a pic- 
ture of its macroscopic structure, at a level of description 
that may disclose traces of its evolutionary history and 
provide information on the complex processes behind its 
generation by the brain. 

One the most generic statistical properties of written 
language is established by Zipf's law, in the form of a 
mathematical relation between the rank of each word in 
a list of all the words used in a text ordered by decreasing 
frequency, and its frequency ||, H . Defining the normal- 
ized frequency of a word in a text as f(r) = n(r)/T, 
where n(r) is the number of occurrences of the word at 
rank r in the frequency-ordered list and T is the total 
number of words in the text, Zipf's law reads 
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In the original formulation of this empirical law, the ex- 
ponent z was taken to be exactly equal to one. If, instead, 
the exponent is assumed as a parameter and fitted to em- 
pirical data it may take on values different from unity. 
After more than fifty years since the discovery of Zipf's 



law in human language, its ultimate origin remains elu- 
sive. However, the remarkable ubiquity of this law over 
diverse languages suggests that its origin must be looked 
for in very general probabilistic aspects associated with 
the process of language generation. Two models explain- 
ing Zipf's law are worth mentioning, which represent two 
very different positions with respect to the linguistic sig- 
nificance of the law. The first one, a stochastic process 
put forward by H. Simon fic| , simulates the dynamics 
of text generation as a multiplicative process that leads 
to Zipf's law for asymptotically long texts. The second 
model is due to B. Mandelbrot JxlJl , who explained Zipf's 
law as a static feature of the statistical structure of ran- 
dom symbolic sequences. While Simon's model gives 
Zipf's law a nontrivial linguistic significance, Mandel- 
brot's explanation renders the law as a mere property 
of a random array of symbols. Let us also recall that, 
recently, it has been proposed that Zipf's law could be 
the result of Markov processes underlying text-generation 
dynamics B. However, the analytical results were only 
tested against the text of the Bible, which is in fact a large 
collection of relatively short writings and whose statistics 
would therefore be subject to strong random fluctuations. 

In this paper, we present a dynamical model that ex- 
plains the empirical behavior of words frequencies as a 
consequence of two processes acting at different scales: 
a global memory effect driven by context, essentially re- 
lated to the interplay between multiplicative and addi- 
tive dynamics in word selection, and a local grammar- 
dependent effect associated with the appearance of word 
inflections. This model is able to reproduce realistic Zipf 
distributions and, additionally, has a linguistic interpre- 
tation. We show that the controversy between Simon's 
and Mandelbrot's views may be solved after explaining 
the particular deviations that empirical distributions ex- 
hibit with respect to the original form of Zipf's law. 

As advanced above, empirical rank-frequency distribu- 
tions from real text sources are more complex than the 
original form of Zipf's law, Eq. (uh. The three upper- 
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FIG. 1: From top to bottom, rank-frequency distributions 
from David Copperfield by Charles Dickens, Don Quijote de 
la Mancha by Miguel de Cervantes, and jEneid by Virgil, re- 
spectively. The fourth distribution represents a realization of 
Simon's model. The dotted lines stand for pure power laws, 
r~ z , fitting the distributions in the intermediate interval of 
ranks (20 < r < 10 3 ). For clarity, the curves are displaced in 
the vertical scale. 



most curves in Fig. [l| show, from top to down, the rank- 
frequency distribution of words for three literary works 
written in English, Spanish and Latin, respectively. In 
all three cases the qualitative features of the distribu- 
tions are similar, but some quantitative differences are 
worth mentioning. Above a short range corresponding to 
the most frequent words, where the behavior of f(r) is 
rather irregular, the distributions decay as a power law 
in an interval of ranks spanning, for these texts, little 
less than two decades. Immediately following the power- 
law regime, the distributions fall slightly faster, as can 
be seen by comparing with the pure power laws shown 
in the graph as dot lines. Note that the values of the 
exponent z measured in the power-law regime deviate 
appreciably from unity in the English and Latin texts. 
This phenomenology seems to be generic, since consis- 
tent quantitative results are found for other works in the 
same languages. 

Simon's model intends to capture the essential fea- 
tures of actual text generation by specifying how words 
are added to the text, as follows. Suppose that at each 
step t a new word is added to the text beginning with 
just one word at t = 1 such that, at any step, the length 
of the text is t. With a fixed probability a, a new word 
not present in the text is added at t + 1 or, with the com- 
plementary probability, the new word is chosen among 
the previous t words at random. Since the probability of 
repeating words that have already appeared is taken to 
be proportional to their number of previous occurrences, 



this process establishes a strong competition among dif- 
ferent words. It can be shown that the long time rank- 
frequency distribution arising from Simon's model is a 
power law of the form of Eq.(|l|) with z = 1 — a. We have 
included in Fig. [I] the results of a simulation of Simon's 
model, with a = 0.01. Comparison with distributions 
obtained from real texts reveals qualitative agreement, 
though quantitative differences are apparent also. In 
particular, Simon's model does not reproduce the faster 
decay at low frequencies, and is unable to explain power- 
law exponents larger than unity, as observed in English 
and Spanish texts. We show in the following that the 
model admits to be modified on the base of linguistically 
sensible assumptions, in such a way as to correct such 
differences. 

In Simon's model, new words are introduced at a con- 
stant rate a, such that the vocabulary size at step t is, 
on the average, Vt = at. Empirical data show however 
that the growth of vocabulary in real texts is typically 
sublinear jj), and may be approximated by 

V t = od v , (2) 

with < v < 1. This generalized form of Vt correspond 
to a rate of introduction of new words given by avt v ~ x . 
This provides the first modification to Simon's model, 
introducing a new parameter v. Instead of using a con- 
stant probability for the introduction of new words, we 
take a step-dependent probability aot" , with cvo — olv. 
We distinguish two main factors that affect the value of 
v. One factor depends on author and style and may 
explain small differences of vocabulary growth between 
works in the same language. However, a stronger ef- 
fect on v results from the degree of inflection of each 
language: highly inflected languages like Latin where a 
single root produces, through declination and conjuga- 
tion, many different words — require a higher value of v 
than poorly inflected languages, such as English. In a 
text written in a highly inflected language, word forms 
will proliferate significantly faster than in languages with 
few inflected forms. 

The original dynamics of Simon's model implies that, 
when a word has to be taken from the text written so far, 
the probability that a word is repeated is proportional to 
m, the number of its previous occurrences. However, 
newly introduced words do not have a clearly defined 
influence on the context yet, and the probability that 
they are used again should be treated in a slightly dif- 
ferent way. Our second modification to the model, thus, 
incorporates a word-dependent threshold Si, in such a 
way that the probability of a new occurrence of word i 
is proportional to max{ni,<5j}. The choice of this set of 
thresholds is to a great extent arbitrary. For simplicity 
we have chosen an exponential distribution, for which we 
just have to specify one parameter, namely the mean S. 
The dynamical effect of this new addition to the model is 
that words that have been recently introduced, for which 
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FIG. 2: As in Fig. the three uppermost curves are the rank- 
frequency distributions obtained for David Copperfield, Don 
Quijote, and TEneid. The fourth distribution represents a re- 
alization of Simon's model. In this case the dotted lines are 
realizations of the model presented in this paper with the pa- 
rameters shown in Table M. 



rii < Si, have a slightly higher competition advantage 
and their reappearance in the text is favored. Since this 
parameter has no influence on words for which rii > Si, 
we expect that the power law region of the distribution, 
where m 3> Si, will not be affected by Si. 

In Fig. H we show three realizations of the model that 
fit quite well the empirical data already shown in Fig. [I]. 
The slope of the distribution in the power-law region is 
accurately reproduced for the three literary works. More- 
over, its faster decay in the low-frequency regime is also 
in close agreement with the actual linguistic data. Ta- 
ble | reports the value of the parameters used to fit the 
distributions. 

We have performed the same analysis using other liter- 
ary works, and obtained consistent results. In particular, 
the model reproduces Zipf distributions of texts written 
in languages with many inflexions, like Latin or Russian 
for instance, with higher values of v than those required 
to fit Zipf distributions obtained from texts written in 
less inflective languages such as English or Spanish. In 
all cases, the faster decay exhibited by the rank-frequency 
distributions at high ranks was obtained by choosing the 
parameter S between 2 and 4. 

We can analytically explain the behavior of our mod- 
ified Simon's model starting from Simon's equations for 
the mean number P n (t) of words with n occurrences at 
step t. These are |lOfl 

I - a 

P n (t + l)=P n (t) + — j —[(n-l)P n - 1 (t)-nP n {t)] (3) 



Source 


oto 
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David Copperfield 


0.03 


0.85 


3 


Don Quijote 


0.05 


0.90 


2 


iEneid 


0.25 


0.95 


3 



TABLE I: Model parameters that reproduce the empirical 
rank- frequency distributions for the three literary works, used 
in the fittings of Fig. 

for n > 1, and 

P 1 (t + l)=P 1 (t) + a-^^P 1 (t). (4) 

Replacing n and t by continuous variables, n — > y and 
t—*t, and P n (t) — ► P(y,t), Eq. (||) can be approximated 
by 

d t P+^d y (yP) = 0, (5) 

while for P(l,t) = P\(t) we have 

A=a-^Pr. (6) 

Consider first the situation where the rate at which 
new words are added depends on t as a — otQt u ~ l (0 < 
v < I), describing a vocabulary that grows less than 
linearly, V oc t u . Equations (^) and (||) can be solved 
assuming that a -C 1 for all t. In such limit, in fact, the 
general solution to Eq. (||) is given by 

P(y,t) = -U[yt- 1 }, (7) 

y 

where II(a;) is an arbitrary function. For the words that 
just appeared once in the text, we have 

Pr(i) = At' 1 + -^f, (8) 

with A an arbitrary constant. The second term domi- 
nates the large-i regime. Focusing on this regime and 
comparing with Eq. (ffl), we find H(x) = ol§x~ v j{v + 1) 
and, thus, 

P{yA = -^fy- x -\ (9) 

Taking into account that P (y, t) and n(r) are related ac- 
cording to r = J Pdy, the Zipf exponent resulting from 
Eq. ffih is z = 1/v > 1. We therefore conclude that 
sublinear vocabulary growth, associated with a poorly 
inflective language, is able to explain a relatively large 
value of z, beyond the predictions of the original Simon's 
model. 

The analytical treatment of our second modification to 
the model requires a preliminary discussion. In fact, the 
thresholds Si cannot be straightforwardly incorporated 
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to Simon's equations. We can however approximate the 
dynamical features associated with them by introducing 
a few simplifications to the process. Let us first assume 
that all the thresholds are set to a value equal to their 
mean, namely S. Consider now the events in which a 
word must be chosen among those written so far. There 
are two possibilities: with probability 7 the word is cho- 
sen among those for which m < 5, with uniform proba- 
bility, or with probability 1 — 7 the word is chosen among 
those for which rii > S, with probability proportional to 
m. A kind of mean field approximation of these dynam- 
ical rules would imply that with probability 7 the new 
word is chosen with uniform probability among all the 
previous words regardless of rij, and with the comple- 
mentary probability the word is chosen with probability 
proportional to n^. 

In the case of constant a, we introduce this additive 
contribution to the evolution of P n (t) in the following 
way: 



In summary, we have investigated the origin of Zipf 's 
law, which stands for the most basic statistical pattern 
found in written human language. Our results confirm 
that the rank-frequency distribution of words is mainly 
the consequence of multiplicative processes underlying 
the language generation process. Moreover, we have been 
able to associate the finer details of empirical distribu- 
tions with two key subprocesses that participate in text 
production. First, a sublinear vocabulary growth law 
that integrates the effect of the inflective structure of 
language. Second, context-dependent dynamics and acti- 
vation thresholds related to the dynamical effect of newly 
introduced words. Our main conclusion is that linguis- 
tically sensible modifications of Simon's model eliminate 
the slight deviations between the results of the original 
model and actual Zipf distributions. This gives addi- 
tional support to the interpretation of Zipf's law as a 
linguistically significant feature of written texts. 



P n (t + 1) = P n (t) 

(l-a)(l -7) 



[(n- l)P„_i(i) -nP n (t)} 



(1 - a)7 



[P n _i(t)-P„(t)]. 



(10) 



The evolution of Pi is still given by Eq. (Q). Particular 
solutions to the continuous version of Eq. (|l0|), indexed 
by the parameter A, read 

P(y,t) = a x t x [(l - i)y + , (n) 

where a\ is an arbitrary constant. 

In the large-t regime and within the assumption a <C 1 
we can put together the effect of i-dependent a, a(i) = 
aoi" -1 , an d of the additive process. In fact, the large- 
t dominant contribution in Eq. (^|) is compatible with 
solution (nil) for A = v. We obtain 



P ^>*) = ^hmi-i)y+i]- l - v/{l - l] - (12) 



Zipf's law results 



1-7 



[(r/ro) 



7], 



(13) 



with r Q (t) = ctQt v I (l+v). The distribution shows a cutoff 
at r = gro, with g = 7~ i V( 1 ~f), which explains its faster 
decay for large ranks. 
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